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C^l . Abstract 

X) ! 

^ ^ The analysis of several algorithms and data structures can be framed as a peeling process on a random 

P^H ■ hypergraph: vertices with degree less than k are removed until there are no vertices of degree less than 

I k left. The remaining hypergraph is known as the A:-core. In this paper, we analyze parallel peeling 

■ processes, where in each round, all vertices of degree less than k are removed. It is known that, below 
I a specific edge density threshold, the A:-core is empty with high probability. We show that, with high 

probability, below this threshold, only iog((<:-i)(r-i)) loglogw + 0(1) rounds of peeling are needed to 

[ obtain the empty A:-core for r-uniform hypergraphs. Interestingly, we show that above this threshold, 

• i2(log«) rounds of peeling are required to find the non-empty A:-core. Since most algorithms and data 

^ I structures aim to peel to an empty A:-core, this asymmetry appears fortunate. We verify the theoretical 

■ results both with simulation and with a parallel implementation using graphical processing units (GPUs). 
I Our implementation provides insights into how to structure parallel peeling algorithms for efficiency in 

■ practice. 

^ : 

O ' 1 Introduction 

!> : 

^ I Consider the following peeling process: starting with a random hypergraph, vertices with degree less than 

O ■ k are repeatedly removed, together with their associated edges. This yields what is called the k-core of the 

^ . hypergraph, which is the maximal subgraph where each vertex has degree at least k. This greedy peeling 

^ I process, and variations on it, have found applications in low-density parity-check codes 16] 13, hash-based 

• 1— I ' sketches @, satisfiability of random boolean formulae ||2][TT|, and cuckoo hashing llT2l . Frequently, the 

^ ■ question in these settings is whether or not the k-core is empty. As we discuss fuither below, it is known 

, that below a specific edge density threshold, the k-core is empty with high probability. This asymptotic 

result in fact accurately predicts practical performance quite well: peeling produces algorithms with very 
fast running times, generally linear in the size of the graph. 

Because of its simplicity and effectiveness, peeling-based approaches appear potentially very useful for 
problems involving large data sets. In this paper, we focus on expanding this applicability by examining the 
use of parallelism in conjunction with peeling. Peeling seems particularly amenable to parallel processing 
via the following simple round-based algorithm: in each round, all vertices of degree less than k and their 
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adjacent edges are removed in parallel from the graph. The major question we study is: how many rounds 
are necessary before peeling is complete? 

We show that, with high probability, below the edge density threshold where the ^-core is empty, only 
iog((<:-i)(r-i)) iQglQg" + ^(^) rounds of peeling are needed for r-uniform hypergraphs. Specifically, we 
show that the fraction of vertices that remain in each round decreases doubly exponentially, in a manner 
similar in spirit to existing analyses of load-balancing problems |[T]|7l. Interestingly, we show in contrast 
that at edge densities above the threshold, with high probability D,{\ogn) rounds of peeling are required to 
find the non-empty /c-core. Since most algorithms and data structures that use peeling aim for an empty 
^-core, the fact that empty ^-cores ai^e faster to find in parallel than non-empty ones appeai^s fortuitous: the 
algorithms will take C?(loglog«) rounds when they succeed. 

We then consider some of the details in implementation, focusing on the algorithmic example of Invert- 
ible Bloom Lookup Tables (IBLTs) Q. An IBLT stores a set of keys, with each key being hashed into r 
cells in a hash table, and all keys in a cell XORed together. The IBLT defines a random hypergraph, where 
keys correspond to edges, and cells to vertices. As we describe later, recovering the set of keys from the 
IBLT corresponds to peeling on the associated hypergraph. Applications of IBLTs are further discussed in 
PI : they can be used, for example, for sparse recovery |i4j and simple low-density parity-check codes [9]. 

Perhaps surprisingly, we cannot find an analysis of parallel peeling in the literature, although early work 
by Karp, Luby, and Meyer auf der Heide on PRAM simulation uses an algorithm similar- to peeling to 
obtain C?(loglog?i) bounds for load balancing [5J, and we use other load balancing arguments HI El for 
inspiration. We also rely heavily on the framework established by MoUoy [illil for analyzing the fc-core of 
random hypergraphs. 



2 The High-Level Argument for the ^-Core 

For constants r > 3 and c, let GJ", denote a random hypergraph with n vertices and cn hyperedges, where 
each hyperedge consists of r distinct vertices. Previous analysis of random hypergraphs have determined 
the threshold values ^ such that when c < c| ^, the fc-core is empty with probability 1 — o(l), and when 
c > cl^., the ^-core is non-empty with probability 1 — o(l). For example, it follows from ifTTI that C2,. = 

nutl.v>0 r(i_e--ty-i i WC find 3 ~ 0.818. 

The neighborhood of v can be accurately modeled as a branching process, with a random number of 
edges adjacent to this vertex, and similarly a random number of edges adjacent to each of those vertices, 
and so on. For intuition, we assume this branching process yields a tree, and further that the number of 
adjacent edges is distributed according to a discrete Poisson distribution with mean rc. These assumptions 
ai^e sufficiently accurate for our analysis, as we later prove. (See e.g. Edll for similar arguments.) 

The intuition for the main result comes from considering the (tree) neighborhood of v, and applying the 
following algorithm: for 1 < / < f — 1 , in round /, look at all the vertices at distance t — i and delete a vertex 
if it has fewer than k — I child edges. Finally, in round t, v is deleted if it has degree less than k. Vertex v 
survives after t rounds of pai^allel peeling if and only if it survives after t rounds of this algorithm. 

In what follows, we denote the probability that v survives after t rounds in this model by Af, and the 
probability a vertex u at distance t — i from v survives / rounds by p,. 

Here po = 1. In this idealized setting, the following relationships hold: 

Pi = Pr(Poisson (p[r/rc) > ^ - 1), 
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and similarly 



Xi = Pr(Poisson(p[JiVc) > k). 

The recursion for p, arises as follows; each node u has a Poisson distributed number of descendant edges 
with mean rc, and each edge has r — 1 additional vertices that survive / — 1 rounds with probability p,-i. By 
the splitting property of Poisson distributions fSl [Chapter 5], the number of surviving descendant edges of 
u is Poisson distributed with mean p?J7j'rc, and this must be at least — 1 for m to itself survive the ith round. 
For convenience, we define 



A = P 



r-l 
i-1 



rc. 



Then, 



Pi 
A,- 



7=0 J- 
7=0 J- 

k-2 OJ 



J! 



r-l 



rc. 



When c < cl, which is the setting where we know the core becomes empty, we have lim^^ooPr = 0, so 
lim,^c» j3; = 0. Thus, for any constant T > 0, we can choose a constant / such that jS/ < T. 

For any x > and ^ > 2, by basic calculus, we have 1 — e^^Y!j=o jr ^ {k-i)r ^PP^y^^E this bound to 
jS/+i gives 



< 



Using induction, we can show that 



p-l){r-l)]' 



{k-l)\ 

{k-l){r-l) 



rc 



rc 



rc 



If 



T^-r > 1, we can apply the upper bound j8/+f < [t( 



:A.-.)('-i)-i lP-')(^-i)r 



, and if 



[(^-1)!]'-' 



< 



1, then j3/+, < r[ik-i)ir-i)]' _ Setting t' = max(T( ) (*'-i)(-i)-> ,t) gives 

Pick T such that t' < 1. The definition of A, and applying the same upper bound gives 



k\ 



< 



/Up-l)(r-l)]' 



k\ 



Solving — — < gives t > ip (ik-\)(r-i)) l^glog^ + This shows that it takes t* 



1 

log((<:-l)(r-l)) 



k\ ^ " 6^""" ' ^ log((/t-l)(r-l)) 

loglogn + 0(1) rounds for Af« = o(l) in our idealized setting. 
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3 Completing the Argument 



To formalize the argument above, we first note that instead of working in the model, we adopt the 
standard approach of having each hyperedge independently in the graph with probability q = cn/ (") . It can 
be shown easily that the result in this model yields the result in the GJj ^„ model (see e.g. Ill [HI). 
We also note the following lemma, that we make use of throughout. 

Lemma 1. For any constant ci > 0, there is a constant C2 > such that with high probability, for all vertices 
V, the neighborhood of distance c\ log log « around v contains at most log'^- n vertices. 

Proof. The result appears in the dissertation of Voll fr4l[Lemma 3.3.1]. Denoting by Nd the number of 
vertices at distance d in the neighborhood of a root vertex u for a Poisson branching process with constant 
mean c > 1, he proves that for £ < 0.029, 

Pr(A^rf>^log(l/£)(3c)'^)<^/£. 

Using £ = 1/n^, he finds the number vertices in the neighborhood of distance up to a log log « around such 
a vertex is polylogarithmic with probability 1 —o{\/n), and the same holds with branching trees under the 
corresponding binomial branching process, and for neighborhoods around vertices of random graphs. Here, 
because we work with hypergraphs, each edge gives r — 1 new child vertices instead of 1, but this only 
affects the calculations by constant factors. We conclude that the number of vertices in the neighborhood of 
a vertex in the model is polylogarithmic with probability 1 — o{\/n). □ 

It will help us to introduce some terminology. We will recursively refer to a vertex other than the root 
as peeled in round / if it has fewer than k—\ unpeeled descendant edges at the beginning of the round; 
similarly, an edge is peeled if some adjacent vertex is peeled. We say unpeeled for an edge or vertex that 
is not peeled. At round 0, all edges and vertices begin as unpeeled. For the root, we require fewer than k 
unpeeled descendant edges before it is peeled. 

We now prove the following theorem: 

Theorem 1. Let r > 3 and k>2. With probability 1 — o(l), the parallel peeling process for the k-core in a 
random hypergraph with r-ary edges terminates after iog{(k-\){r-\)) loglog^ + rounds when c < c^^.. 

Proof. Deviations from Poisson: We focus on how the process deviates from the idealized branching 
process, showing it leads to only lower order effects. We view the branching process as generating a breadth 
first search from the initial vertex. Note that, once a vertex is expanded in the breadth first seai^ch, it cannot 
be the child of a subsequent vertex. Also, the number of edges adjacent to a vertex is not Poisson distributed, 
but a binomial random variable, as each edge appears independently with probability q = cnj ("). Let Zj, 
be the number of already expanded vertices in the breadth first search when expanding a node u. Then the 
number of edges adjacent to m is a binomial random variable with mean i^^^"^^^q. By Lemma[T] with high 
probabihty the neighborhood over the 0(loglog«) levels we require is polylogarithmic in n, and conditioned 
on this event, the deviations introduced by Z„ affects our estimates of jS, and A, by sublinear amounts. By a 
straightforward analysis using Le Cam's Theorem, we can bound the deviations in j3, and A, introduced by 
the difference between our idealized Poisson model and the actual binomial distribution by only o(nr^l^). 
As long as j3; and A,- are the recursive equations for jS,- and A, are accurate up to 1 +o(l) factors 

with high probability. 

Two issues remain. First, we have to get A; down fully to o(l/?i), not Q.(n^^l^\ so we can apply a union 
bound over all the vertices. We first show how to do this assuming the neighborhood is a tree. Second, 
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we may have duplicated vertices as we expand the neighborhood. When duplicates appear, parts of our 
neighborhood tree expansion are no longer independent, as in our idealized analysis. We show how to 
modify the analysis to cope with duplicate vertices. 

Bounding A,- for Trees: Under the assumption that the neighborhood is a tree, we can get the error proba- 
bility down to o{l/n). 

Note that for the root to be unpeeled after / rounds, there must be at least k>2 adjacent unpeeled edges, 
corresponding to at least 4 (distinct, from our tree assumption) unpeeled children vertices that after / — 1 
rounds, since r > 3. Correspondingly, there must be at least 8 (distinct) unpeeled vertices after / — 2 rounds, 
since each of the 4 unpeeled children of the root must have one adjacent unpeeled edge with 2 vertices that 
are grandchildren of the root, corresponding to at least 8 unpeeled vertices. We have shown that each vertex 
remains unpeeled for at most t* = iog((i(:-i)(r-i)) loglog" + 0(1) rounds with probability o(?i^^/^). These 
8 unpeeled vertices can be chosen from the at most polylogarithmic number of grandchildren, which gives 



t* + 2 rounds is (polylog(?i)?i ^/^)^ = o{l/n). We can take a union bound over all vertices for our final 
1 — o(l) bound. 

Dealing with duplicate vertices: Finally, we now explain that, with probability 1 — o(l), we need worry 
only about a single duplicate vertex in the neighborhood for all vertices, and further that this only adds an 
additive constant to the number of parallel rounds required. As we expand the neighborhood using breadth 
first search, the probability of a duplicate vertex occurring during any expansion step is only polylog («)/«. 
As the neighborhood contains only a polylogarithmic number of vertices, the probability of having at least 
two duplicate vertices within the neighborhood of any vertex is o(l/«). 

It is therefore sufficient to show that having one duplicate vertex in the neighborhood only adds a con- 
stant number of rounds to the parallel peeling process. Consider what happens when a vertex u is duplicated. 
This creates an obvious dependence, that we would like to avoid in our argument. Hence, if we encounter a 
duplicate, we pessimistically assume that it prevents two vertices adjacent from the root from peeling. Even 
with this assumption, we show that simply adding one additional layer of expansion in the neighborhood 
allows the root to peel by round t* + 3 with high probability. 

Consider what happens in ?* + 3 rounds when there is 1 duplicate vertex. In order for the root to remain 
unpeeled, it must have at least 4 (not necessarily distinct) vertices at distance 1 that are unpeeled at round 
t* + 2. At most two of these four vertices have a descendant that is a duplicate or is itself a duplicate. Hence, 
for the root to remain unpeeled, for at least t* + 2 rounds, at least two vertices must remain unpeeled out of 
the polylog(?i) children of the root, when the neighborhood of these vertices for t* +2 rounds is a tree. By 
our previous calculations, the probability that this occurs is o(l /n), completing the proof. □ 

Remark: One can obtain better than 1 — o( 1 ) bounds on the probability of terminating after iog((;t-i)(r-i)) " 
0(1) rounds when c <c\^. For example, \ —o{\/n) bounds are possible when r > 3; the argument requires 
considering cases for the possibility that 2 vertices ai^e duplicated in the neighborhood ai^ound a vertex. How- 
ever, one cannot hope for probability bounds of 1 — o(l/?i'') for an arbitrary constant a when duplicate edges 
may appear, as is typical for hashing applications. The probability the fc-core is not empty because k edges 
share the same r vertices Q.{n^'^''^'^^'') for constant k, r, and graphs with a linear number of hyperedges, 
which is already Q.{\/n) for ^ = 2 and r = 3. 

4 Above the Threshold 

Now consider the case when c > c| ^. We show that parallel peeling requires Q(log?i) rounds in this case. 




Hence, via a union bound, the probability that v survives 
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Molloy ifTTI showed that in this case there exists a p > such that lim^^ooPr = p. Similai^ly, Iim,^oo/3f = 
j8 > and lim,^oo = A > 0. It follows that the core will have size Xn + o{n). We examine how j3f and Xt 
approach their limiting values to show that the parallel peeling algorithm takes Q.{logn) rounds. 

Theorem 2. Let r > 3 and k>2. With probability 1 — o(l), the parallel peeling process for the k-core in a 
random hypergraph with r-ary edges terminates after Q(log«) rounds when c > c^^, 

Proof. First, let jS; = j8 + e for some £ > 0. We begin by working in the idealized branching process model 
given in Section|2]to determine the behavior of fif. 



Note /3 corresponds to the fixed point 



k-2 13 j 



r-1 



rc. 



Let Sk-2 = I -=0 TT and = ijl^ f , where Sk-3 = if = 2. Then, 1^)2^ 
Hence, 



k-2 (/3+e) 



J'- 



■■Sk-2+Sk-3S + 0{e^). 



Pi+i = [l-e-P-^Sk-2+Sk-3£ + 0{e^))Y 'rc 



[\-e-PSk-2 + e-I^Sk-2[l-e- 



-e - — e 



Sk-2 



r-1 



rc 



[1 - e-PSk-2 + e-PSk-2[i1 - Sk-3/Sk-2)e + 0(6^ 



ir-l 



rc 



[l-e-PSk-2y-'rc + ir- 
l5+ae + 0{e^). 



1 ) ( 1 - e-I^Sk-2ye-l'Sk-2 [( 1 - Sk-3/Sk-2)£ + 0{e')]rc + 0{e') 



The penultimate line uses the Binomial Theorem. Here a = 1^^-/5. "^^ ^^"^^ ^ constant with a <l. 

Next, we know that A = 1 — e^l^Sk-i. By the same calculation as above, it can be shown that 

Xi = X + e-PSk-iil-Sk-2/Sk-i)£ + 0is^), 
Xi+i=X + e-PSk-iil-Sk-2/Sk-i)ae + 0ie^). 

Hence, for suitably small e values, in each round A, gets closer to A by at most some constant factor a' = 
a + 0{e) under the idealized model. Under the „, model, suitable martingale concentration arguments 
apply (see e.g. ||2l[Tll), and eiTors in the recursion due to duplicate vertices in the neighborhood do not have 
a significant effect. That is, for t = ylogn for a suitably small 7, we can show that for all / with 1 < / < f, 
concentration bounds yield that A, is close to its expectation. Further, the effect of duphcate vertices on 
Af can be made at most o(«^^/^). We choose 7 so that A^ is ? = n(log?i) but A^ in the idealized model is 
D.{n^^^^). It follows that in the GJj model, the number of vertices that survive some t = D.{logn) pai^allel 
rounds is at least Q,{n^/^) when c > c| ^. □ 

Remark: While our experimental evidence and the recursion lead us to conjecture that there is a correspond- 
ing C?(log«) upper bound when above the threshold c| ^, we have not found a straightforward way to prove 
this fact. For our purposes, it is the distinction in behavior of the parallel algorithm above and below the 
threshold that is important, but we believe it would also be useful to have the upper bound in this case. 
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c = 


= 0.7 


c = 


0.75 


c = 


= 0.8 


c = 


0.85 


n 


Failed 


Rounds 


Failed 


Rounds 


Failed 


Rounds 


Failed 


Rounds 


10000 





12.504 





23.352 


1000 


17.037 


1000 


10.773 


20000 





12.594 





23.433 


1000 


19.028 


1000 


11.928 


40000 





12.791 





23.343 


1000 


20.961 


1000 


12.992 


80000 





12.939 





23.372 


1000 


22.959 


1000 


14.104 


160000 





12.983 





23.421 


1000 


25.066 


1000 


15.005 


320000 





13.000 





23.491 


1000 


27.089 


1000 


16.305 


640000 





13.000 





23.564 


1000 


29.281 


1000 


17.334 


1280000 





13.000 





23.716 


1000 


31.037 


1000 


18.499 


2560000 





13.000 





23.840 


1000 


33.172 


1000 


19.570 



Table 1: Results of simulations using r = 4 and k = 2, over 1000 trials. 



5 Experimental Results 

We implemented a simulation of the parallel peeling algorithm, using the C^^ ^„ model where each edge is 
chosen independently and uniformly from the set of (") possible edges. To check the growth of the number 
of rounds as a function of n, we ran the program 1000 times for r = 4,^ = 2 and various values of n and c, 
and computed the average number of rounds for the peeling process to complete. For reference, C2 4 « 0.772. 
Table [T] shows the results. 

For all the experiments, when c < C24, all 1000 trials succeeded (empty ^-core) and when c > 4, all 
1000 trials failed (non-empty ^-core). This is unsurprising given the distance from the asymptotic threshold. 
For c < C24, the average number of rounds increases very slowly with n, while for c> c\^, the average 
increases approximately linearly in log?i. This is in accord with our C?(loglogn) result below the threshold 
and il(log?i) result above the threshold. The results for other values of r and k were similar. 

Another experiment we did was to test how well idealized values from the recurrence for A, approximates 
the fraction of vertices left after t rounds. In the following tests, we used r = 4, A; = 2 and n = \ million. 
For each value of c, we averaged over 1000 trials. Table |2] shows that the recurrence indeed describes the 
behavior of the peeUng process quite well, both below and above the threshold. 

6 GPU Implementation 

6.1 Motivation 

Using a graphics processing unit (GPU), we developed a parallel implementation for Invertible Bloom 
Lookup Tables (IBLTs), a data structure recently proposed by Goodrich and Mitzenmacher ||4l. Two moti- 
vating applications are sparse recovery |@] and efficiently encodable and decodable error correcting codes 
||9l . For brevity we describe here only the sparse recovery application. 

In the sparse recovery problem, A'^ items are inserted into a set S, and subsequently all but n of the items 
are deleted. The goal is to recover the exact set 5, using space proportional to the final number of items n, 
which can be much smaller than the total number of items T^i that were ever inserted. IBLTs achieve this 
roughly as follows. The IBLT maintains 0(ii) cells, where each cell contains a key field and a checksum 
field. We use r hash functions /ii, . . . ^hy. When an item x is inserted or deleted from 5, we consider the r 
cells h\{x) ... hr{x), and we XOR the key field of each of these cells with x, and we XOR the checksum field 
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c = 0.7 




c = 0.85 


t 


Prediction 


Experiment 


t 


Prediction 


Experiment 


1 


768922 


768925 




1 


853158 


853172 


2 


673647 


673664 




2 


811184 


811200 


3 


608076 


608097 




3 


793026 


793042 


4 


553064 


553091 




4 


784269 


784281 


5 


500466 


500503 




5 


779841 


779851 


6 


444828 


444872 




6 


777550 


777559 


7 


380873 


380930 




7 


776350 


776359 


8 


302531 


302607 




8 


775719 


775728 


9 


204442 


204550 




9 


775385 


775394 


10 


93245 


93398 




10 


775209 


775218 


11 


14159 


14269 




11 


775115 


775124 


12 


74 


78 




12 


775066 


775074 


13 


0.00001 







13 


775039 


775048 


14 










14 


775025 


775034 


15 










15 


775018 


775026 


16 










16 


775014 


775022 


17 










17 


775012 


775020 


18 










18 


775011 


775019 


19 










19 


775010 


775018 


20 










20 


775010 


775018 



Table 2: Results of simulations on how well the recursion approximates the number of vertices left after t 
rounds. The experiments ai^e run using r = 4,k = 2,n = \ million, over 1000 trials. 

of each of these cells with checkSum(A;), where checkSum is some simple pseudorandom function. Notice 
that the insertion and deletion procedures are identical. 

In order to recover the set S, we iteratively look for "pure" cells - these ai^e cells that only contains one 
item X in the final set S. Every time we find a pure cell whose key field is x, we recover x and delete x from 
S, which hopefully creates new pure cells. We continue until there are no more pure cells, or we have fully 
recovered the set S. 

The IBLT defines a random r-uniform hypergraph G, in which vertices correspond to cells in the IBLT, 
and edges correspond to items in the set S. Pure cells in the IBLT correspond to vertices of degree less 
than k = 2. The IBLT recovery procedure precisely corresponds to a peeling process on G, and the recovery 
procedure is successful if and only if the 2-core of G is empty. 

6.2 Implementation Details 

Our parallel IBLT implementation consists of two stages: the insertion/deletion stage, during which items 
are inserted and deleted from the IBLT, and the recovery phase. Both phases can be parallelized. 

One method of parallelizing the insertion/deletion phase is as follows: we devote a separate thread to 
each item to be inserted or deleted. A caveat is that multiple threads may try to modify a single cell at 
any point in time, and so we have to use atomic XOR operations, so that threads trying to write to the 
same cell do not interfere with each other. In general, atomic operations can be a bottleneck in any parallel 
implementation; if t threads try to write to the same memory location, the algorithm will take at least t 
(serial) time steps. 

We par-allelize the recovery phase as follows. We proceed in rounds, and in each round we devote a single 
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thread to each cell in the IBLT. Each thread checks if its cell is pure, and if so it identifies the item contained 
in the cell, removes all r occurrences of the item from the IBLT, and marks the cell as recovered. The 
implementation proceeds until it reaches an iteration where no items are recovered - this can be checked 
by summing up (in parallel) the number of cells marked recovered after each round, and stopping when 
this number does not change. This procedure also requires atomic XOR operations, as two threads may 
simultaneously try to write to the same cell if there are two or more items x^y recovered in the same round 
such that hi(x) = h\(y) for some 1 < / < r. 

In addition, we must take care to avoid deleting an item multiple times for the IBLT. Indeed, since any 
item X inserted into the IBLT is placed into r cells, x might be contained in multiple pure cells at any instant, 
and the thread devoted to each such pure cell may try to delete x. To prevent this, we split the table up into r 
subtables, and hash each item into one cell in each subtable upon insertion and deletion. When we execute 
the recovery algorithm, we iterate through the subtables serially (which requires r serial steps per round), 
processing each subtable in parallel. This ensures that an item x only gets removed from the table once, since 
the first time a pure cell is found containing x, x gets removed from all the other subtables. This recovery 
procedure corresponds to an interesting variant of the peeling process we analyze further in Section |7] 

6.3 Experimental Results 

All of our serial code was written in C++ and all experiments were compiled with g++ using the -03 
compiler optimization flag and run on a workstation with a 64-bit Intel Xeon architecture and 48 GBs of 
RAM. We implemented all of our GPU code in CUDA with all compiler optimizations turned on, and ran 
our GPU implementation on an NVIDIA Tesla C2070 GPU with 6 GBs of device memory. 

Summary of results. Relative to our serial implementation, our GPU implementation achieves 10x-12x 
speedups for the insertion/deletion phase, and 20x speedups for the recovery stage when the edge density of 
the hypergraph is below the threshold for successful recovery (i.e. empty 2-core). When the edge density 
is slightly above the threshold for successful recovery, our parallel recovery implementation was only about 
7x faster than our serial implementation. The reason for this is two-fold. Firstly, above the threshold, many 
more rounds of the pai^allel peeling process were necessaiy before the 2-core was found. Secondly, above 
the threshold, less work was required of the serial implementation because fewer items were recovered; in 
contrast, the pai^allel implementation examines every cell every round. 

Our detailed experimental results are given in Tables |3] (for the case of r = 3 hash functions) and |4] 
(for the case of r = 4 hash functions). The timing results are averages over 10 trials each. For the GPU 
implementation, the reported times do count for the time to transfer data (i.e. the items to be inserted) from 
the CPU to the GPU. 

The reported results are for a fixed IBLT size, consisting of 2^^ cells. These results are representative 
for all sufficiently large input sizes: once the number of IBLT cells is larger than about 2'^, the runtime of 
our pai^allel implementation, grows roughly linearly with the number of table cells (for any fixed table load). 
Here, table load refers to the ratio of the number of items in the IBLT to the number of cells in the IBLT. 
This corresponds to the value c in the corresponding hypergraph. The linear increase in runtime above a 
certain input size is typical, and is due to the fact that there is a finite number of threads that the GPU can 
launch at any one time. 
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Table 
Load 


No. Table 
Cells 


% 

Recovered 


GPU 
Recovery Time 


Serial 
Recovery Time 


GPU 
Insert Time 


Serial 
Insert Time 


0.75 


16.8 million 


100% 


0.33 s 


6.37 s 


0.31 s 


3.91 s 


0.83 


16.8 million 


50.1% 


0.42 s 


3.64 s 


0.35 s 


4.34 s 



Table 3: Results of our parallel and serial IBLT implementations with r = 3 hash functions. The table load 
refers to the ratio of the number of items in the IBLT to the number of cells in the IBLT. 



Table 
Load 


No. Table 
Cells 


% 

Recovered 


GPU 
Recovery Time 


Serial 
Recovery Time 


GPU 
Insert Time 


Serial 
Insert Time 


0.75 


16.8 million 


100% 


0.47 s 


8.37 s 


0.42 s 


4.55 s 


0.83 


16.8 million 


24.6% 


0.25 s 


2.28 s 


0.46 s 


5.0 s 



Table 4: Results of our parallel and serial IBLT implementations with r = 4 hash functions. The table load 
refers to the ratio of the number of items in the IBLT to the number of cells in the IBLT. 

7 Parallel Peeling with Subtables 

The parallel peeling process used in our GPU implementation of IBLTs in Section |6] does not precisely 
correspond to the one analyzed in Sections |3] and |4] The differences are two-fold. First, the underlying 
hypergraph G in our IBLT implementation is not chosen uniformly from all r-uniform hypergraphs; instead, 
vertices in G (i.e., IBLT cells) are partitioned into r equal-sized sets (or subtables) of size n/r, and edges 
are chosen at random subject to the constraint that each edge contains exactly one vertex from each set. 
Second, the peeling process in our GPU implementation does not attempt to peel all vertices in each round. 
Instead, our GPU implementation proceeds in subrounds, where each round consists of r subrounds. In the 
/th subround of a given round, we remove all the vertices of degree less than k in the /th subtable. Note that 
running one round of this algorithm is not equivalent to running one round of the original parallel peeling 
algorithm. This is because peeling the first subtable may free up new peelable vertices in the second subtable, 
and so on. Hence, running one round of the algorithm used in our GPU implementation may remove more 
vertices than running one round of the original algorithm. 

In this section, we analyze the peeling process used in our GPU implementation. We can use a similar 
approach as above to obtain the recursion for the survival probabilities for this algorithm. Let p, y be the 
probability that a vertex in the tree survives / rounds when it's in the 7th subtable, with each po,y = 1. Then, 

Pij = Pr(Poisson(]^p,-,, ]^p,_i_/,rc) >k-\). 

h<j h>j 

By the same reasoning, 

Xij = Pr(Poisson(]^p,-/,]^p,_i,Arc) > k) 

h<j h>j 

and we can consider 

iKj h>j 

These equations differ from our original equation in a way similar to how the equations for standard multiple- 
choice load-balancing differ from Vocking's asymmetric variation of multiple-choice load-balancing, where 
a hash table is similai^ly split into r subtables, each item is given one choice by hashing in each subtable. 
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c 


= 0.7 


c 


= 0.75 


n 


Failed 


Subrounds 


Failed 


Subrounds 


10000 





26.018 





47.732 


20000 





26.142 





47.659 


40000 





26.273 





47.666 


80000 





26.452 





47.783 


160000 





26.585 





47.769 


320000 





26.790 





47.925 


640000 





26.957 





48.070 


1280000 





27.006 





48.141 


2560000 





27.012 





48.175 



Table 5: Results of simulations of peeling with subtables using r = 4 and k = 2, over 1000 trials. 



and the item is placed in the least loaded subtable, breaking ties according to some fixed ordering of the 
subtables HI [13. 

Motivated by this, we can show that in this variation, below the threshold, these values eventually de- 
crease "Fibonacci exponentially", that is, with the exponent falling according to a generalized Fibonacci 
sequence. We follow the same approach as outlined in Section |2] Let j3,', = j3,j- where m = {i — l)r + j, and 
similarly for and pl„, so we may work in a single dimension. Let F,— 1(0 represent the ith number in an 
Fibonacci sequence of order r—l. We choose a constant / so that /3/_|_q < for an appropriate constant 

^ and < a < r — 2. Then, 



< 



< 



< 



n F 

/< j<i+r y'^ 
rc 



IV. 



rc 



i\k-i 



[{k 


-l)!]'-i 




rc 


[{k 


-l)!]'-i 




rc 


[{k 


-l)!]'-i 



n iPj) 

I<j<I+r 
0<j<r 



We can induct to see that the exponent of in the jS^ values falls according to a generahzed Fibonacci 
sequence of order r—l, leading to an asymptotic constant factor reduction in the number of overall rounds, 
even as we have to work over a larger number of subrounds. 



7.1 Simulations with Subtables 

We ran simulations for this parallel peeling algorithm with subtables with the same values of r,k,n and c 
below the threshold. Table |5] shows the results. The number of subrounds is at most r times the number 
of rounds in the original parallel peeling algorithm, but our analysis of Section |7] suggests the number of 
subrounds should be significantly smaller. In this case, the factor is about 2. 

We also did simulations on how closely the recursion tracks the number of vertices left after peeling the 
jth subtable in the ith round. Denote the expected fraction of vertices left in the recursion by A/y, and Xj j is 
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c = 0.7 




j 


Prediction 


Experiment 


1 


1 


942230 


942230 


1 


2 


876807 


876803 


1 


3 


801855 


801855 


1 


4 


714875 


714878 


2 


1 


678767 


678771 


2 


2 


643070 


643080 


2 


3 


609686 


609697 


2 


4 


581912 


581919 


3 


1 


554402 


554414 


3 


2 


527335 


527341 


3 


3 


500469 


500476 


3 


4 


472470 


472475 


4 


1 


442874 


442871 


4 


2 


410958 


410956 


4 


3 


375770 


375764 


4 


4 


336458 


336447 


5 


1 


292159 


292144 


5 


2 


242396 


242374 


5 


3 


187891 


187866 


5 


4 


131789 


131776 


6 


1 


80372 


80376 


6 


2 


40582 


40600 


6 


3 


15481 


15503 


6 


4 


3649 


3666 


7 


1 


348 


354 


7 


2 


6 


6 


7 


3 


0.003 


0.008 


7 


4 









Table 6: Results of simulations of peeling with subtables on how well the recursion approximates the number 
of vertices left after t rounds. The experiments are run using r = 4,k = 2,n = I milhon, over 1000 trials. 



given by the formula 



^ V/)<y h>j / 



where Aoj- = 1 for all / The results ai^e presented in Table |6] where the prediction column is given by A/y?i. 
As it can be seen, this recursion closely matches the number of vertices left in the simulation. 



8 Conclusion 

In this paper, we analyzed parallel versions of the peeling process on random hypergraphs. We showed 
that when the number of edges is below the threshold edge density for the ^-core to be empty, with high 
probability the parallel algorithm takes 0(loglog?i) rounds to peel the ^-core to empty. In contrast, when the 
number of edges is above the threshold, with high probability it takes Q.([ogn) rounds for the algorithm to 
terminate with a non-empty k-core.. We also considered some of the details of implementation and proposed 



12 



a variant of the parallel algorithm that avoids some implementation issues. Our experiments confirm our 
theoretical results and show that in practice, peeUng in parallel provides a considerable increase in efficiency 
over the serialized version. 

We believe this work spawns many open questions for further research. Our simulations suggest that the 
number of rounds depend on the values k, r and c. In particular, when c is close to the threshold density c| ^, 
it takes considerably more rounds for the algorithm to terminate. A natural next step would be to analyze 
the growth of the number of rounds also as a function of c| ^. — c or c — ^. 

A much more careful study of the issues arising in the implementation of pai^allel peeling algorithms 
would also be interesting. For example, in our IBLT implementation, two adjacent threads in the insertion 
or recovery process may make wildly divergent memory accesses, because each item is hashed to r cells that 
are deliberately spread randomly throughout the IBLT. It may be possible to modify the hashing technique 
in a way the enables a higher degree of memory locality without significantly reducing the probability of 
successful recovery. More generally, there may exist complicated unexplored tradeoffs between the locality 
of memory accesses, space requirements of the data structure, and the degree of parallelism achievable. 
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